Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add grok and dissect methods to runtime fields #68088

Merged
merged 4 commits into from
Feb 1, 2021

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jan 27, 2021

This adds a grok and a dissect method to runtime fields which
returns a Matcher style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with grok would look like this:

String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}

And dissect would look like:

String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}

We'll work later to get it down to a clean looking one liner, but for
now, this'll do.

The grok and dissect methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.

Closes #67825

@nik9000 nik9000 added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.12.0 labels Jan 27, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jan 27, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

We'll work later to get it down to a clean looking one liner, but for
now, this'll do.

The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.

Closes elastic#67825
/**
* A simple pattern that we can implement against grok and dissect.
*/
public interface SimplePattern {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of us is particularly happy with this name. Sure, right now it is like java's Pattern class but with only a single method so, sure, it is "simple" now. But "Simple" just sounds like a bad class name. AbstractingPattern? NamedMatchesPattern? GrokOrDissectPattern? FlavoredPattern? These sound kind of bad to me too. But maybe better than SimplePattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone can come up with a better name for this thing then I'm all ears.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst Pinging for a name :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could use something other than Pattern, since this doesn't actually directly have anything to do with java regexes. In essence it is a functional interface to take a string, and return a map of named substrings. Maybe NamedGroupExtractor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NamedGroupExtractor will do just fine. We can always change it later because folks won't compile against it. I hope.

@@ -824,6 +831,48 @@ public void visitInvokeCallMember(InvokeCallMemberNode irInvokeCallMemberNode, C
int j = i;
irInvokeCallMemberNode.getArgumentNodes().get(i).visit(this, (e) -> irInvokeCallMemberNode.getArgumentNodes().set(j, e));
}
PainlessMethod method = irInvokeCallMemberNode.getDecorationValue(IRDMethod.class);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bit that runs the method at compile time if it has the annotation. We only implemented this for statically imported methods defined like from_class or defined with instance binding because that is all we needed. We did not implement this for regular methods or augments or bound_to declarations. All of those throw an error when you try to load the PainlessLookup to warn folks that we never implemented this feature for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an issue that will allow annotations to determine if they're allowed on a certain whitelist type (#68322).

@@ -1220,7 +1221,13 @@ public void visitConstant(ConstantNode irConstantNode, WriteScope writeScope) {
else if (constant instanceof Byte) methodWriter.push((byte)constant);
else if (constant instanceof Boolean) methodWriter.push((boolean)constant);
else {
throw new IllegalStateException("unexpected constant [" + constant + "]");
/*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit relies on the new compiler phase below.


import java.lang.reflect.Modifier;

/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the secret magic of this PR: if you define a ConstantNode you used to have to attach a constant that'd fit in the "constant pool". See the instanceof tree below for what fits. This finds what doesn't fit and causes us to inject the values when we build the script. This is a trick we can do because painless is always embedded in a running process so we can just build what we need at compile time and stick it on the script objects we build. javac has to write class files and it wouldn't have a place to "put" that state. We're lucky to be embedded, making this fairly easy for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but would it make sense to combine this into the constant folding phase rather than have a separate phase for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to run this "after" all of the constants have resolved, just in case some constant doesn't fit in the constant pool but we end up removing it during the constant folding phase.

@@ -131,7 +131,7 @@ public void testNoArgs() throws Exception {
scriptEngine.compile("testNoArgs3", "_score", NoArgs.CONTEXT, emptyMap()));
assertEquals("cannot resolve symbol [_score]", e.getMessage());

String debug = Debugger.toString(NoArgs.class, "int i = 0", new CompilerSettings());
String debug = Debugger.toString(NoArgs.class, "int i = 0", new CompilerSettings(), Whitelist.BASE_WHITELISTS);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can take a List<Whitelist> now so I can test the output using the fancy binding changes. To make sure the constants really get inlined.

@@ -15,6 +15,8 @@ tasks.withType(JavaCompile).configureEach {
dependencies {
compileOnly project(":server")
compileOnly project(':modules:lang-painless:spi')
api project(':libs:elasticsearch-grok')
api project(':libs:elasticsearch-dissect')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I ever know the difference between api and compileOnly. I just copied these two lines from ingest-common.

"runtime_fields.grok.watchdog.max_execution_time",
TimeValue.timeValueSeconds(1),
Setting.Property.NodeScope
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ingest these are node level settings so I made them node level settings. That might have been an accident of history. Runtime fields usually run inside of a search context so we could maybe do better than the watchdog that we have, but that seems like a problem for another time.

import java.util.Map;

public abstract class BooleanFieldScript extends AbstractFieldScript {

public static final ScriptContext<Factory> CONTEXT = newContext("boolean_script_field", Factory.class);

static List<Whitelist> whitelist() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed so we can plumb grok and dissect into place one time.

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 29, 2021
This creates `.emit()` methods on a few primitive things for `keyword`
and `date` style runtime fields. Now you can do stuff like:
```
"d": {
  "type": "date",
  "script": "'2020-01-18T17:41:34.000Z'.emit()"
}
```

We get to piggy back of painless's dynamic dispatch code to regonize
that you are in a `date` and trying to emit a `String` so we can "do the
right thing" without any extra runtime cost. In this case we just parse
the date using the `formatter` on the field. And since the example above
doesn't use a formatter we get ISO8601, the date format of kings.

Similarly, you can do stuff like:
```
"s": {
  "type": "keyword",
  "script": """
    for (int i = 0; i < 100; i++) {
      i.emit();
    }
  """
}
```

Painless *knows* `i` is an `int` and will call an emit method that emits
its string value.

Also! Assuming we get the syntax proposed in elastic#68088, because this is a
chain of method invocations you can do something like:
```
"i": {
  "type": "long",
  "script": """
    grok('%{NUMBER:i} %{NUMBER:j}').extract(doc['message'].value)?.i?.emit()
  """
}
```

This should be read as "if grok matches the message and extracts a value
for `i` then emit it to the runtime field." If either the grok doesn't
match or doesn't extract an `i` value then nothing will be emitted. As
an extra nice thing - we'll automatilly convert whatever the `grok`
expression outputs into a `long`. All of it handled for us using
painless's standard dynamic dispatch code.
Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding this! Very cool change. Added some thoughts for possible clean up/future PRs, but nothing that I would consider necessary for this PR.


import java.util.Map;

public class CompileTimeOnlyAnnotationParser implements WhitelistAnnotationParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we haven't been great about this on previous annotations, but would you please add sentence or two on what this one does as a class-style JavaDoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+++

@@ -393,6 +395,10 @@ public void addPainlessConstructor(Class<?> targetClass, List<Class<?>> typePara
"[[" + targetCanonicalClassName + "], " + typesToCanonicalTypeNames(typeParameters) + "]", iae);
}

if (annotations.containsKey(CompileTimeOnlyAnnotation.class)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily for this PR - I wonder if it would make sense to allow annotations to know what they're applied to as part of the annotation parser so we could reject invalid types there where type is constructor, method, field, binding, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. As we add more of these this kind of one off stuff isn't going to be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an issue (#68322)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (int i = 0; i < irInvokeCallMemberNode.getArgumentNodes().size(); i++) {
ExpressionNode argNode = irInvokeCallMemberNode.getArgumentNodes().get(i);
if (argNode instanceof ConstantNode == false) {
// TODO find a better string to output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the TODO here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the text for the non-constant, and the name (I think it's available here as part of the node?) of the method we're trying to call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking with #68324 so we don't forget because we don't have any good ideas now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I filed the issue I'll do it in a follow up.

}
}

private void replaceCallWithConstant(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this utility method is worth it since it's only called once. I prefer for now that the visit methods all kind of be lumped together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called twice in the method above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, never mind!


import java.lang.reflect.Modifier;

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but would it make sense to combine this into the constant folding phase rather than have a separate phase for it?

@nik9000 nik9000 merged commit 419ce10 into elastic:master Feb 1, 2021
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 1, 2021
…8088)

This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

We'll work later to get it down to a clean looking one liner, but for
now, this'll do.

The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.

Closes elastic#67825
nik9000 added a commit that referenced this pull request Feb 1, 2021
…68332)

This adds a `grok` and a `dissect` method to runtime fields which
returns a `Matcher` style object you can use to get the matched
patterns. A fairly simple script to extract the "verb" from an apache
log line with `grok` would look like this:
```
String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

And `dissect` would look like:
```
String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb;
if (verb != null) {
  emit(verb);
}
```

We'll work later to get it down to a clean looking one liner, but for
now, this'll do.

The `grok` and `dissect` methods are special in that they only run at
script compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.

Closes #67825
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 3, 2021
Replaces the double `Pattern.compile` invocations in painless scripts
with the fancy constant injection we added in elastic#68088. This caused one of
the tests to fail. It turns out that we weren't fully iterating the IR
tree during the constant folding phases. I started experimenting and
added a ton of tests that failed. Then I fixed them by changing the IR
tree walking code.
nik9000 added a commit that referenced this pull request Feb 3, 2021
Replaces the double `Pattern.compile` invocations in painless scripts
with the fancy constant injection we added in #68088. This caused one of
the tests to fail. It turns out that we weren't fully iterating the IR
tree during the constant folding phases. I started experimenting and
added a ton of tests that failed. Then I fixed them by changing the IR
tree walking code.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 3, 2021
Replaces the double `Pattern.compile` invocations in painless scripts
with the fancy constant injection we added in elastic#68088. This caused one of
the tests to fail. It turns out that we weren't fully iterating the IR
tree during the constant folding phases. I started experimenting and
added a ton of tests that failed. Then I fixed them by changing the IR
tree walking code.
nik9000 added a commit that referenced this pull request Feb 4, 2021
Replaces the double `Pattern.compile` invocations in painless scripts
with the fancy constant injection we added in #68088. This caused one of
the tests to fail. It turns out that we weren't fully iterating the IR
tree during the constant folding phases. I started experimenting and
added a ton of tests that failed. Then I fixed them by changing the IR
tree walking code.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 4, 2021
As of elastic#68088 painless can have methods where all parameters must be
constant. This improves the error message when the parameter isn't. It's
still not super great, but its better and its what we can easilly give
at that point in the compiler.
nik9000 added a commit that referenced this pull request Feb 4, 2021
As of #68088 painless can have methods where all parameters must be
constant. This improves the error message when the parameter isn't. It's
still not super great, but its better and its what we can easilly give
at that point in the compiler.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 4, 2021
As of elastic#68088 painless can have methods where all parameters must be
constant. This improves the error message when the parameter isn't. It's
still not super great, but its better and its what we can easilly give
at that point in the compiler.
nik9000 added a commit that referenced this pull request Feb 4, 2021
As of #68088 painless can have methods where all parameters must be
constant. This improves the error message when the parameter isn't. It's
still not super great, but its better and its what we can easilly give
at that point in the compiler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.12.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants